The two datasets I have used before on Project 1. I have chosen County Health Rankings in Texas and COVID-19 vaccinations by county in Texas. The variables contained in the County Health Rankings in Texas come partially from my Biostatistics project: county, food environment index, primary care physician population, and urban/rural classification. The variables contained in COVID-19 vaccinations by county in Texas deal with counties, the population who is eligible for vaccination (people 12 and older), and population of fully vaccinated individuals. The food environment index was calculated by measuring access to healthy foods and the income of counties and primary care physician population was found by looking at ratios of population to primary care physicians as well as including D.O.s. Population data was aquired by the 2019 US Census Bureau Population Estimates and the population who have been fully vaccinated or partially vaccinated was collected through vaccination records submitted by health care providers.
This data is interesting to me because I want to see if there’s any sort of relationship to be found within between the two datasets. The food environment index takes into account access to supermarkets and grocery stores, which usually also have pharmacies as well where vaccinations can be given. There are 254 observations total. There are some observations with NA values, when they are removed the total is 232. Of those 254 observations, 172 are rural and 82 are urban! Of those the set with no NA values, there are 156 that are rural and 76 that are urban.
library(tidyverse)
countyhealth <- read_csv("~/Data from County Health Rankings and Biostats project.csv")
countyvaccines <- read_csv("~/COVID-19 Vaccine Data by County.csv")
combinedcounty <- inner_join(countyhealth, countyvaccines, by = "county")
combinedcountyfinal <- combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older) *
100, percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older) *
100) %>% select(-total.doses.allocated, -vaccine.doses.administered,
-people.fully.vaccinated, -people.vaccinated.with.at.least.one.dose)
combinedcountyfinal <- combinedcountyfinal %>% mutate(urban = ifelse(urban.rural.classification ==
"Urban", 1, 0))
combinedcountyfinalnona <- combinedcountyfinal %>% na.omit()
combinedcountyfinalnona %>% group_by(urban.rural.classification) %>%
summarize(count = n())
## # A tibble: 2 x 2
## urban.rural.classification count
## <chr> <int>
## 1 Rural 156
## 2 Urban 76
library(cluster)
clust_dat <- combinedcountyfinal %>% dplyr::select(food.environment.index,
percentage.fully.vaccinated, primary.care.physician.population) %>%
na.omit()
clust_dat <- clust_dat %>% na.omit() %>% scale
sil_width <- vector()
for (i in 2:10) {
pam_fit <- pam(clust_dat, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
final <- clust_dat %>% as.data.frame
pam1 <- final %>% pam(4)
pam1
## Medoids:
## ID food.environment.index percentage.fully.vaccinated
## [1,] 23 -0.0933065 -0.6051163
## [2,] 33 0.7322188 0.2688540
## [3,] 200 -1.8360822 0.5711513
## [4,] 205 0.7322188 2.0288380
## primary.care.physician.population
## [1,] -0.1934691
## [2,] -0.2515207
## [3,] -0.2732901
## [4,] 3.6379412
## Clustering vector:
## [1] 1 2 1 3 1 2 2 2 1 2 1 1 1 4 2 1 2 1 2 2 2 3 1 2 2 2 2 1 3 1 1 2 2 1 1 1 1
## [38] 1 4 1 2 2 1 2 1 1 3 2 2 2 1 4 1 2 1 2 2 1 3 1 2 1 2 4 1 1 1 2 1 2 3 4 1 1
## [75] 2 1 2 2 2 1 2 1 1 1 1 2 1 1 2 3 1 1 4 1 1 2 2 1 3 1
## [ reached getOption("max.print") -- omitted 132 entries ]
## Objective function:
## build swap
## 0.8680776 0.8250323
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
final <- final %>% mutate(cluster = as.factor(pam1$clustering))
ggplot(final, aes(x = food.environment.index, y = percentage.fully.vaccinated,
color = cluster)) + geom_point()
final %>% slice(pam1$id.med)
## food.environment.index percentage.fully.vaccinated
## 1 -0.0933065 -0.6051163
## 2 0.7322188 0.2688540
## 3 -1.8360822 0.5711513
## 4 0.7322188 2.0288380
## primary.care.physician.population cluster
## 1 -0.1934691 1
## 2 -0.2515207 2
## 3 -0.2732901 3
## 4 3.6379412 4
library(plotly)
final %>% plot_ly(x = ~food.environment.index, y = ~percentage.fully.vaccinated,
z = ~primary.care.physician.population, color = ~cluster,
type = "scatter3d", mode = "markers")
plot(pam1, which = 2)
library(GGally)
ggpairs(final, aes(color = cluster))
When I scaled and standardized all the data, I thought it was interesting to see that there were negative values within my data. I chose to do 4 clusters as it had the highest silhouette width. Cluster 4 seems to have all high values of food.environment.index, percentage of fully vaccinated people, and primary care physicians. Cluster 3 seems to have really low food environment index, medium level percentage of fully vaccinated people, and somewhat low primary care physicians. Cluster 2 has a medium food environment index, medium percentage of fully vaccinated people, and a low population of primary care physicians. Cluster 1 has low values of food environment index, fully vaccinated people, and primary care physicians. In the end, I don’t think the goodness-of-fit was that great, the highest average silhouette width was only .34 which means the structure is weak and may be artificial.
countyfinal <- combinedcountyfinal %>% na.omit() %>% select(-urban) %>%
select_if(is.numeric) %>% scale
rownames(countyfinal) <- combinedcountyfinalnona$county
county_pca <- princomp(countyfinal, cor = T)
names(county_pca)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
summary(county_pca, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.624027 1.1832634 0.9532786 0.201300882 0.114720132
## Proportion of Variance 0.527493 0.2800224 0.1817480 0.008104409 0.002632142
## Cumulative Proportion 0.527493 0.8075154 0.9892634 0.997367858 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## food.environment.index 0.428 0.903
## primary.care.physician.population 0.507 0.440 -0.220 -0.708
## population.12.and.older 0.507 0.438 -0.228 0.705
## percentage.fully.vaccinated 0.499 -0.452 0.199 -0.711
## percentage.vaccinated.with.at.least.one.dose 0.486 -0.477 0.209 0.702
countydf <- data.frame(County = combinedcountyfinalnona$county,
PC1 = county_pca$scores[, 1], PC2 = county_pca$scores[, 2])
ggplot(countydf, aes(PC1, PC2)) + geom_point()
I decided to retain PC1 and PC2 as they were enough to explain about 81% of the total variance in the dataset. For PC1, a high score means higher population of people 12 and older, as well as primary care physicians. It also means a larger percentage that are vaccinated with one dose and fully vaccinated. A lower score for PC1 means a lower population of people 12 and older and lower number of primary care physicians. It also means a lower percentage that are vaccinated with one dose and fully vaccinated. Interestingly, the food environment index does not play a big role in PC1.
For PC2 a high score means high food environment index, primary care physician population, population 12 and older, but a lower percentage fully vaccinated and lower percentage vaccinated with at least one dose. A low score for PC2 means lower food environment index, primary care physician, and population of people 12 and older. It also means higher percentage of people vaccinated with at least one dose (and fully vaccinated).
library(caret)
fit <- glm(urban ~ food.environment.index + primary.care.physician.population +
population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose,
data = combinedcountyfinalnona, family = "binomial")
coef(fit)
## (Intercept)
## -7.8100830137
## food.environment.index
## 0.6230329258
## primary.care.physician.population
## -0.0221492676
## population.12.and.older
## 0.0000673533
## percentage.fully.vaccinated
## 0.1191668386
## percentage.vaccinated.with.at.least.one.dose
## -0.0928097189
probs <- predict(fit, type = "response")
probs %>% round(3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.441 0.192 0.645 0.084 0.041 0.545 0.359 0.108 0.147 0.851 0.083 0.177 1.000
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1.000 0.214 0.050 0.159 0.628 1.000 0.998 0.064 0.009 0.228 0.214 0.401 0.393
## 27 28 29 30 31 32 33 34 35 36 37 38 39
## 0.138 0.098 1.000 0.069 0.181 0.098 0.457 0.398 0.063 0.076 0.062 0.065 1.000
## 40 41 42 43 44 45 46 47 48 49 50 51 52
## 0.074 0.242 0.993 0.118 0.078 0.358 0.617 0.004 0.090 0.115 0.105 0.025 1.000
## 53 54 55 56 57 58 59 60 61 62 63 64 65
## 0.049 0.177 0.059 1.000 0.174 0.053 0.005 0.073 0.988 0.071 0.998 1.000 0.226
## 66 67 68 69 70 71 72 73 74 75 76 77 78
## 0.087 0.251 0.233 0.080 0.120 0.007 1.000 0.122 0.112 0.207 0.208 1.000 0.170
## 79 80 81 82 83 84 85 86 87 88 89 90 91
## 0.219 0.052 0.176 0.149 0.971 0.830 0.157 0.998 0.248 0.037 0.089 0.023 0.057
## 92 93 94 95 96 97 98 99 100
## 0.492 1.000 0.663 0.031 1.000 0.127 0.728 1.000 0.318
## [ reached getOption("max.print") -- omitted 132 entries ]
class_diag(probs, combinedcountyfinalnona$urban, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8922 0.7368 0.9679 0.918 0.8175 0.8524 0.881
table(truth = combinedcountyfinalnona$urban, prediction = probs >
0.5) %>% addmargins
## prediction
## truth FALSE TRUE Sum
## 0 151 5 156
## 1 20 56 76
## Sum 171 61 232
cv <- trainControl(method = "cv", number = 8, classProbs = T,
savePredictions = T)
fit <- train(urban ~ food.environment.index + primary.care.physician.population +
population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose,
data = combinedcountyfinalnona, trControl = cv, method = "glm")
class_diag(fit$pred$pred, fit$pred$obs, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7716 0.3158 0.9936 0.96 0.4752 0.6547 0.7613
The values is 1 for urban and 0 for rural. The model is pretty fair in predicting the new observations per CV AUC. There is definitely signs of overfitting as the CV AUC is less when compared to the regular AUC. The confusion matrix is interesting in that it mostly predicts urban counties correctly, but utterly fails as it tries to predicts rural counties.
library(caret)
knn_fit <- knn3(urban ~ food.environment.index + primary.care.physician.population +
population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose,
data = combinedcountyfinalnona, k = 8)
y_hat_knn <- predict(knn_fit, combinedcountyfinalnona)
y_hat_knn
## 0 1
## [1,] 0.500 0.500
## [2,] 0.750 0.250
## [3,] 0.375 0.625
## [4,] 0.750 0.250
## [5,] 0.750 0.250
## [6,] 0.625 0.375
## [7,] 0.875 0.125
## [8,] 1.000 0.000
## [9,] 0.750 0.250
## [10,] 0.375 0.625
## [11,] 1.000 0.000
## [12,] 1.000 0.000
## [13,] 0.000 1.000
## [14,] 0.000 1.000
## [15,] 1.000 0.000
## [16,] 1.000 0.000
## [17,] 0.875 0.125
## [18,] 0.250 0.750
## [19,] 0.000 1.000
## [20,] 0.000 1.000
## [21,] 0.875 0.125
## [22,] 0.750 0.250
## [23,] 1.000 0.000
## [24,] 0.875 0.125
## [25,] 0.625 0.375
## [26,] 0.625 0.375
## [27,] 0.750 0.250
## [28,] 0.875 0.125
## [29,] 0.000 1.000
## [30,] 1.000 0.000
## [31,] 0.875 0.125
## [32,] 1.000 0.000
## [33,] 0.625 0.375
## [34,] 0.625 0.375
## [35,] 1.000 0.000
## [36,] 0.875 0.125
## [37,] 1.000 0.000
## [38,] 1.000 0.000
## [39,] 0.000 1.000
## [40,] 1.000 0.000
## [41,] 0.750 0.250
## [42,] 0.000 1.000
## [43,] 0.875 0.125
## [44,] 1.000 0.000
## [45,] 0.625 0.375
## [46,] 0.500 0.500
## [47,] 0.875 0.125
## [48,] 1.000 0.000
## [49,] 1.000 0.000
## [50,] 0.625 0.375
## [ reached getOption("max.print") -- omitted 182 rows ]
class_diag(y_hat_knn[, 2], combinedcountyfinalnona$urban, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8578 0.6053 0.9808 0.9388 0.736 0.793 0.9219
table(truth = factor(combinedcountyfinalnona$urban == 1, levels = c("TRUE",
"FALSE")), prediction = factor(y_hat_knn[, 2] > 0.5, levels = c("TRUE",
"FALSE"))) %>% addmargins
## prediction
## truth TRUE FALSE Sum
## TRUE 46 30 76
## FALSE 3 153 156
## Sum 49 183 232
cv <- trainControl(method = "cv", number = 8, classProbs = T,
savePredictions = T)
fit <- train(urban ~ food.environment.index + primary.care.physician.population +
population.12.and.older + percentage.fully.vaccinated + percentage.vaccinated.with.at.least.one.dose,
data = combinedcountyfinalnona, trControl = cv, method = "knn")
class_diag(fit$pred$pred, fit$pred$obs, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8434 0.636 0.9444 0.848 0.7268 0.7902 0.839
The model is good at predicting new observations per CV AUC. There are signs of overfitting as the CV AUC is less than the regular AUC. The nonparametric model is actually better than the linear model in its cross-validation performance, the AUC is higher. The confusion matrix predicted more rural counties correctly than the linear classifier confusion matrix.
fit <- lm(percentage.fully.vaccinated ~ primary.care.physician.population +
food.environment.index, data = combinedcountyfinalnona)
yhat <- predict(fit)
mean((combinedcountyfinalnona$percentage.fully.vaccinated - yhat)^2)
## [1] 113.1287
k = 8
data <- combinedcountyfinalnona[sample(nrow(combinedcountyfinalnona)),
]
folds <- cut(seq(1:nrow(combinedcountyfinalnona)), breaks = k,
labels = F)
diags <- NULL
for (i in 1:k) {
train <- data[folds != i, ]
test <- data[folds == i, ]
fit <- lm(percentage.fully.vaccinated ~ primary.care.physician.population +
food.environment.index, data = train)
yhat <- predict(fit, newdata = test)
diags <- mean((test$percentage.fully.vaccinated - yhat)^2)
}
mean(diags)
## [1] 160.6415
The MSE for the overall dataset was 113.1287. There is definitely signs of overfitting as the average MSE from my k testing folds were higher than the MSE for the overall dataset.
library(reticulate)
hi <- "Hello"
combinedcounty %>% mutate(percentage.fully.vaccinated = (people.fully.vaccinated/population.12.and.older) *
100, percentage.vaccinated.with.at.least.one.dose = (people.vaccinated.with.at.least.one.dose/population.12.and.older) *
100)
## # A tibble: 254 x 11
## county food.environmen… primary.care.ph… urban.rural.cla… total.doses.all…
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Ander… 6.4 19 Rural 17870
## 2 Andre… 7.8 9 Rural 6100
## 3 Angel… 6.3 51 Rural 58100
## 4 Arans… 5.4 9 Urban 6840
## 5 Archer 7.7 NA Urban 3200
## 6 Armst… 6.7 0 Urban 2100
## 7 Atasc… 7.6 11 Urban 19000
## 8 Austin 7.9 5 Urban 8700
## 9 Bailey 7.9 2 Rural 24135
## 10 Bande… 6.6 5 Urban 5100
## # … with 244 more rows, and 6 more variables: vaccine.doses.administered <dbl>,
## # people.vaccinated.with.at.least.one.dose <dbl>,
## # people.fully.vaccinated <dbl>, population.12.and.older <dbl>,
## # percentage.fully.vaccinated <dbl>,
## # percentage.vaccinated.with.at.least.one.dose <dbl>
import pandas as pd
(r.combinedcounty.assign(PercentageofFullyVaccinated = lambda x: (r.combinedcounty["people.fully.vaccinated"]/r.combinedcounty["population.12.and.older"])*100)
.assign(PercentageofVaccinatedatleastonedose= lambda x:(r.combinedcounty["people.vaccinated.with.at.least.one.dose"]/r.combinedcounty["population.12.and.older"])*100)
.sort_values(by=('PercentageofFullyVaccinated'), ascending=True))
## county ... PercentageofVaccinatedatleastonedose
## 134 King ... 26.976744
## 150 Loving ... 31.538462
## 175 Newton ... 29.357335
## 82 Gaines ... 29.461144
## 172 Motley ... 41.484301
## 147 Lipscomb ... 36.100815
## 196 Roberts ... 36.301370
## 43 Collingsworth ... 39.761513
## 62 Dickens ... 37.982501
## 16 Borden ... 38.488576
## 148 Live Oak ... 39.943209
## 203 San Jacinto ... 40.684029
## 210 Sherman ... 46.287328
## 158 Martin ... 44.198524
## 179 Oldham ... 40.918691
## 32 Carson ... 41.651883
## 33 Cass ... 41.760810
## 229 Upshur ... 41.968618
## 59 Delta ... 41.644325
## 66 Eastland ... 43.321368
## 103 Haskell ... 45.226329
## 113 Howard ... 42.378078
## 116 Hutchinson ... 43.139622
## 18 Bowie ... 44.025636
## 57 Dawson ... 43.875895
## 205 San Saba ... 43.947717
## 201 Sabine ... 42.878513
## 214 Stephens ... 42.430651
## 168 Montague ... 44.416032
## 189 Rains ... 43.239489
## .. ... ... ...
## 56 Dallas ... 73.642414
## 45 Comal ... 72.816969
## 185 Pecos ... 74.151376
## 83 Galveston ... 71.119941
## 104 Hays ... 74.881504
## 123 Jim Hogg ... 81.373265
## 60 Denton ... 72.633149
## 129 Kendall ... 74.194345
## 141 La Salle ... 76.269878
## 136 Kleberg ... 76.823952
## 100 Harris ... 78.031637
## 14 Bexar ... 79.242652
## 244 Willacy ... 79.894824
## 65 Duval ... 81.879560
## 232 Val Verde ... 87.695806
## 114 Hudspeth ... 79.729403
## 23 Brooks ... 81.081554
## 63 Dimmit ... 106.534750
## 245 Williamson ... 80.266381
## 252 Zapata ... 82.400146
## 42 Collin ... 80.228136
## 226 Travis ... 82.485059
## 78 Fort Bend ... 86.433031
## 70 El Paso ... 90.212661
## 107 Hidalgo ... 94.436963
## 30 Cameron ... 95.904598
## 161 Maverick ... 114.542086
## 213 Starr ... 99.807476
## 239 Webb ... 112.245796
## 188 Presidio ... 119.195104
##
## [254 rows x 11 columns]
In python, I was able to recalculate my dataset’s percentage of people fully vaccinated and percentage of people vaccinated with one dose through Python’s version of tidyverse!